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^ ■ Abstract 

^ ' Feature selection techniques have been used as the workhorse in biomarker discovery appHcations for 

\^ [ a long time. Surprisingly, the stability of feature selection with respect to sampling variations has long 

been under-considered. It is only until recently that this issue has received more and more attention. In 
this article, we review existing stable feature selection methods for biomarker discovery using a generic 



u 

' hierarchal framework. We have two objectives: (1) providing an overview on this new yet fast growing 

topic for a convenient reference; (2) categorizing existing methods under an expandable framework for 
future research and development. 
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1 Introduction 

Recent advances in genomics and proteomics enable the discovery of biomarkers for diagnosis and treat- 
ment of complex diseases at the molecular level fT!|. A biomarker may be detined as "a characteristic that is 
objectively measured and evaluated as an indicator of normal biological processes, pathogenic processes, or 
pharmacologic responses to a therapeutic intervention" Q. 

The discovery of biomarkers from high-throughput "omics" data is typically modeled as selecting the most 
discriminating features (or variables) for classification (e.g. discriminating healthy versus diseased, or different 
tumor stages) ESI. In the language of statistics and machine learning, this is often referred to as feature 
selection. Feature selection has attracted strong research interest in the past several decades. For recent reviews 
of feature selection techniques used in bioinformatics, the reader is referred to ||5l|6l[3l|7l. 

While many feature selection algorithms have been proposed, they do not necessarily identify the same 
candidate feature subsets if we repeat the biomarker discovery procedure [jS]. Even for the same data, one 
may find many different subsets of features (either from the same feature selection method or from different 
feature selection methods) that can achieve the same or similar predictive accuracy lH [TOj [TTl. In practice, 
high reproducibihty of feature selection is equally important as high classification accuracy l[T2l . It is widely 
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believed that a study that cannot be repeated has httle value |[T3l . Consequently, the instabiUty of feature 
selection results will reduce our confidence in discovered markers. 

The stability issue in feature selection has received much attention recently. In this article, we shall review 
existing methods for stable feature selection in biomarker discovery applications, summarize them with an 
unified framework and provide a convenient reference for future research and development. 

This article differs from existing review papers on feature selection in the following aspects: 

• Compared to current feature selection reviews lEl HI [3l |7l, this review focuses only on those feature 
selection approaches that incorporate "stabiUty" into the algorithmic design. 

• This article mainly focuses on "methods" for finding reliable markers rather than "metrics" of measuring 
the stability of selected feature subsets |[T4l . although we also list these metrics for completeness. 

The remainder of the paper is organized as follows. In section 2, we discuss several sources that cause the 
instability of feature selection. In section 3, we summarize available stable feature selection algorithms and 
describe different classes of methods in detail. In section 4, we provide a list of stability measures and illustrate 
their definitions. We give a discussion in section 5. Finally, we conclude this paper in section 6. 

2 Causes of Instability 

There are mainly three sources of instability in biomarker discovery: 

1. Algorithm design without considering stability: Classic feature selection methods aim at selecting a 
minimum subset of features to construct a classifier of the best predictive accuracy [Sj. They often ignore 
"stability" in the algorithm design. 

2. The existence of multiple sets of true markers: It is possible that there exist multiple sets of potential true 
markers in real data. On the one hand, when there are many highly conelated features, different ones 
may be selected under different settings [Si. On the other hand, even there are no redundant features, the 
existence of multiple non-correlated sets of real markers is also possible |[T5l . 

3. Small number of samples in high dimensional data: In the analysis of gene expression data and pro- 
teomics data, there are typically only hundreds of samples but thousands of features. It has been experi- 
mentally verified that the relatively small number of samples in high dimensional data is one of the main 
sources of the instability problem in feature selection |[T6l[T7l . To understand the nature of the instability 
of selected feature subset, Ein-Dor et al lITSl developed a new mathematical model and concluded that at 
least thousands of samples are needed to achieve stable feature selection. 

Here we list three sources that can cause the instability of feature selection in biomarker discovery. We 
believe that there may be still other sources that can affect the stability of feature selection. The identification 
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Figure 1 : A hierarchical framework for stable feature selection methods. 

of these sources is of primary importance for future research and development. On the one hand, knowing 
the reason enables us to better understand the problem. On the other hand, such knowledge will facilitate the 
design of new methods for stable biomarker discovery. 

3 Existing Methods 

To date, there are many methods available for stable feature selection. We wish to cover all existing methods 
in a systematic and expandable manner. Fig{T]illustrates our approach to summarizing different methods based 
on the way they treat different sources of instabilities. Briefly, the ensemble feature selection method and the 
method using prior feature relevance incorporate stability consideration into the algorithm design stage. To 
handle data with highly correlated features, the group feature selection approach treats feature cluster as the 
basic unit in the selection process to increase robustness. The sample injection method tries to increase the 
sample size to address the small-sample-size vs. large-feature-size issue. In the following sections, we will 
discuss each category in detail. 

3.1 Ensemble Feature Selection 

In statistics and machine leaimng, ensemble learning methods combine multiple learned models under the 
assumption that "two (or more) heads are better than one". Typical ensemble learning methods such as bagging 
|[T9l and boosting |[20l have been widely used in classification and regression. Ensemble feature selection 
techniques use a two-step procedure that is similar- to ensemble classification and regression: 
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1. Creating a set of different feature selectors. 

2. Aggregating the results of component selectors to generate the ensemble output. 

The second step is typically modeled as a rank aggregation problem. Rank aggregation combines multiple 
rankings into a consensus ranking, which has been widely studied in the context of web search engines EH. In 
most cases, the strategies rely on the following information: 

• The ordinal rank associated with each feature. 

• The score assigned to each feature. 

To date, many rank aggregation approaches have been proposed and the reader is referred to lfT4l for a 
survey of popular aggregation methods used in bioinformatics. 

Both theoretical and experimental results have suggested that the generation of a set of diverse component 
learners is one of the keys to the success of ensemble learning ll22l . To construct diverse local learners, two 
strategies are widely used: data perturbation and function perturbation. 

3.1.1 Data Perturbation 

Data perturbation tries to run component learners with different sample subsets (e.g.. Bagging |[T9l . Boost- 
ing |[20l ) or in distinct feature subspaces (e.g.. Random Subspace |[23l ). In ensemble feature selection with data 
perturbation, different samplings of the original data are generated to construct different feature selectors, as 
described in Figj2] Several recent methods Il24l l25l l26l l27l fall into this category. These methods can be fur- 
ther distinguished according to the sampling method, the component feature selection algorithm and the rank 
aggregation method (see Table [T|l. 

The combination of data sampling and ensemble learning for feature selection is probably the most intuitive 
idea to handle selection instability with respect to sampling variation. The superiority of such strategy has been 
verified both experimentally |l24l|23 and theoretically Il25ll26l . 

Interestingly, all methods listed in Table[T]are based on the same aggregation scheme, i.e., linear combina- 
tion. Note that it is also feasible to combine data perturbation with other complicated aggregation procedures 
such as those ones used in function perturbation (see next subsection). 

3.1.2 Function Perturbation 

Here we use function perturbation to refer to those ensemble feature selection methods in which the compo- 
nent learners are different from each other. The basic idea is to capitalize on the strengths of different algorithms 
to obtain robust feature subsets. 

Function perturbation is different from data perturbation in two perspectives: 

• It uses different feature selection algorithms rather than the same feature selection method. 
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Figure 2: Ensemble feature selection using data perturbation. We first use different sub-samplings of training 
data to select features and then build a consensus output with a rank aggregation method. 

• It typically conducts local feature selection on the original data (without sampling). 

Existing ensemble feature selection methods in this category ll30l[3ni32l[33l differ mainly in the aggregation 
procedure: 

• The distance synthesis method is used in OOll . 

• The Markov chain based rank aggregation method II2T1 is utilized in lOTTl . 

• The linear combination method is used in |[32l . 

• The concept of stacking 041 is applied to the aggregation of feature selection results in |[33l . 

Compared to data perturbation, function perturbation is less flexible since the ensemble scale is limited by 
the number of available feature selection algorithms in the system. As a result, no more than four component 
feature selectors are used in the ensemble learning process |[30l[3ni32ll33]| . 

3.2 Feature Selection with Prior Feature Relevance 

In most biomarker discovery applications, we typically assume that all features are equally relevant before 
the selection procedure. In practice, some prior knowledge may be available to bias the selection towards some 
features assumed to be more relevant |[35l[36l . It has been shown that the use of prior knowledge on relevant 
features induces a large gain in stability with improved classification performance OSl . 
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Table 1: Classification of data perturbation based ensemble feature selection methods. Here linear combination 
methods aggregate rankings using the (weighted) min, max, or sum operation. The filter method is a general 
feature selection strategy that attempts to rank features solely according to their relevance to target class. 



Reference 


Sampling Method 


Feature Selector 


Aggregation Method 


Davis et al IH 


Random Subset 


Filter Method 


Lineal- Combination 


Bach p25!| 


Boostrap 


Lasso m 


Linear Combination 


Meinshausen and Buhlmann |[26l 


Random Subset 


Randomized Lasso 


Linear Combination 


Abeel et al ^ 


Random Subset 


SVM-REF im 


Linear Combination 



Figl3]shows that there are several methods for obtaining such kind of prior knowledge. One feasible method 
is to seek advices from domain experts or relevant publications. For instance, in gene expression data classifi- 
cation, one biologist may know or guess that some genes are likely to be more relevant |[35l . 

Another more interesting method is to obtain such prior knowledge from relevant data sets using transfer 
learning |[36l . Transfer learning focuses on extracting knowledge from source task and applying it to a different 
but related task OTl . In Il36l . those features that have been identified as markers from other data sets are 
considered to be more relevant in the new feature selection task. 

Though the prior knowledge is helpful in improving the stability of feature selection, using such information 
deserves certain limitations since biomarker discovery aims at finding new features rather than known ones. 

3.3 Group Feature Selection 

One motivation for group feature selection is that groups of correlated features commonly exist in high- 
dimensional data, and such groups are resistant to the variations of training samples jTCl. If each feature group 
is considered as a coherent single entity, potentially we may improve the selection stability. 

Existing group feature selection algorithms follow the procedure described in Fig. |4] There are two key 
steps: group formation and group transformation. 

Group formation is the process of identifying groups of associated features. There ai^e typically two classes 
of methods for this purpose: knowledge-driven methods and data-driven methods. The knowledge-driven group 
formation method utilizes domain knowledge to facilitate the generation of feature groups. For example, genes 
normally function in co-regulated groups, making it feasible to search genes in the same pathway for group 
identification. In contrast, the data-driven group formation method finds feature clusters using only information 
contained in the input data. 

Group transformation generates a single coherent representation for each feature group. The transforma- 
tion method can range from simple approaches Uke feature value mean |[38l to complicated methods such as 
principal component analysis |[39l . 
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Figure 3: Stable feature selection using prior knowledge on features. The prior knowledge on relevant features 
are either obtained from domain experts or extracted from relevant data sets via transfer learning. 
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Figure 4: A generic group feature selection framework. In the first step, we identify feature groups using either 
knowledge-driven methods or data-driven approaches. In the second step, we transform each feature group into 
a single entity. Finally, we conduct feature selection in the transformed space. 
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Figure 5: An illustration of knowledge-driven group formation and feature transformation. The pathway infor- 
mation is used to guide the search of correlated features (genes or proteins). Each identified feature group is 
transformed into a new feature for further analysis. 

In the following subsections, we will discuss existing group feature selection methods according to their 
group formation strategies. 

3.3.1 Knowledge-Driven Group Formation 

Recent advances in the construction of large protein networks make it feasible to find genes or proteins that 
have coherent expression patterns in the same pathway Q. Using available protein-protein interaction (PPI) net- 
works, a number of approaches have been proposed to incorporate the pathway information into the biomarker 
discovery procedure. As shown in FigjSl the basic idea is to find a group of associated genes or proteins from 
the same pathway, and then transform this group into a new entity for subsequent feature selection and classifi- 
cation. It has been shown that such knowledge-based method is capable of achieving more reproducible feature 
selection and higher accuracy BOl . 

We can further distinguish available methods in this category according to their target data: gene expression 
data and proteomics data. 

Before summarizing existing approaches on biomarker discovery using gene pathways, we would like to 

discuss a closely related problem: gene set significance testing. Testing the statistical significance of gene 

pathways or clusters has been extensively investigated. Well-known examples include the gene set enrichment 

'For simplicity, tliis paper uses the terms "gene ontology", "pathway", and "gene set" interchangeably, although they may not be 
strictly equivalent. 
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analysis PTI and the maxmean approach H2l . The reader is referred to B3l l44l for comprehensive reviews 
of existing approaches on gene set analysis. Here we highlight the fact that gene set significance testing is 
different from pathway-guided biomarker discovery since they have different objectives. The objective of gene 
set significance testing is to find whether a given gene set satisfies the hypothesis, while biomarker discovery 
aims at searching for a small subset of genes that can distinguish cases from controls as accurately as possible. 
Their intrinsic connection is that we can utilize pathway significance assessment method as a filter method (a 
special type of feature selection technique that considers each entity individually) for ranking pathway markers. 

Some methods have been proposed to identify markers not as individual genes but gene sets |[38l [39l l40l 
|45l|46l|47l|48l|49l|50l. In Table |2j we provide a brief summary of existing biomarker identification methods 
that use gene pathway information as prior knowledge. These methods exploit different strategies for group 
generation and transformation. In group generation, we can use all genes in the pathway for a clear biological 
interpretation. Alternatively, we can search for a subset of genes so as to obtain one more discriminating 
group. To effectively represent each group, various summary statistics have been applied, ranging from mean 
to principal component analysis. 

Table 2: Summary of gene pathway biomarker discovery methods. In the generation of gene groups, we either 
accept all genes in a given pathway or use heuristic search methods (such as greedy algorithm) to find a subset 
of discriminating genes. Here "No transformation" means that we use all the genes in the group to represent this 
gene set. GXNA (Gene expression Network Analysis) is a software package developed in lISTTl for identifying 
a subset of differentially expressed genes from a given pathway. 



Reference 


Group Generation 


Group Transformation 


Quo et al lEH 


Use all genes 


Mean and median 


Rapaport et al 1391 


Use all genes 


Principal component analysis 


Chuang et al 


Greedy search 


Sum of z-scores 


Tai and Pan ESI 


Use all genes 


No transformation 


Lee et al Egl 


Greedy search 


Sum of z-scores 


Hwang and Park US 


Greedy search 


Mean 


Yousef et al EH 


GXNAlSa 


No transformation 


Chen and Wang EU 


Use all genes 


Principal component analysis 


Su et al 1501 


Use all genes 


Sum of log-likelihood ratio 



Recently, such knowledge-driven approach has also been applied to proteomics data for biomarker discov- 
ery at the level of protein group Il52ll53l . Compared to gene expression data, more research efforts toward this 
direction are desired in future research. 

The pathway-guided group formation method has the advantage that new transformed features are bio- 
logically interpretable since the underlying disease process may be dependent on perturbations of different 
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pathways. Thus, prediction models based on pathways may approximate the true disease process more closely 
than gene-based models B9l . Its main disadvantage is that we may group um^elated genes or proteins since the 
reliability of the predicted interactions in PPl network is still questionable |[54]| . 

3.3.2 Data-Driven Group Formation 

Instead of relying on prior knowledge of biology, the data-driven group formation method identifies feature 
clusters using either cluster analysis |[55l|56l|57l|58l|53|6l|6lll62ll63or density estimation |l8l[l6l. As sum- 
marized in Table[3l clustering-based methods utilize popular partition algorithms such as hierarchical clustering 
or A;-means to generate feature groups. It should be noted that most existing clustering-based methods do not 
explicitly consider the stability of feature group. Alternatively, kernel density estimation is utilized in HI [161 
based on the observation that dense core regions are stable respect to samplings of dimensions. 

Table 3: Summary of clustering-based group feature selection methods according to clustering algorithms and 
group transformation methods. 



Reference Clustering Methods Group Transformation 

Hastie et al |[55l Hierarchical clustering Mean 
Jornsten and Yu 1561 Integrated clustering and group selection Mean 

Au et al iSTi -modes One most discriminating feature 

Ma et al Il58l i^-means A subset of most discriminating features 

Ma and Haung |[59l Hierarchical clustering/iir-means A subset of most discriminating features 

Yousel et al |[60l i^-means No transformation 

Park et al 11611 Hierarchical clustering Mean 

Shin et al |[62l Hierarchical clustering One most discriminating feature 

Tang et al ll63l Fuzzy fc-means No transformation 



There is another class of related methods assigning comparable coefficients to correlated, important vari- 
ables. The "elastic net" Il64l is a typical example in this category. We omit these methods in this survey since 
they didn't explicitly identify feature groups. 

The data-driven group feature selection method fully exploits the characteristics of target data so that it is 
widely applicable. One main drawback is that it is not easy to interpret and validate the selected feature group 
biologically. One possible remedy is to use a hybrid strategy that combines the data-driven method with the 
knowledge-driven method, as recently discussed in |[65ll66l . 
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3.4 Feature Selection with Sample Injection 

In biomarker discovery applications, the number of features is typically larger than the sample size. This is 
one of the main sources of instability in feature selection. To increase the reproducibility of feature selection, 
one natural idea is to generate more samples. However, the generation of real sample data from patients and 
healthy people is usually expensive and time-consuming. With this practical limitation in mind, people begin 
to seek other alternative methods for the same purpose. 

From the viewpoint of data analysis, there are two data augmentation strategies: 

• Utilizing test data to increase the sample size in feature selection process, which can be modeled as a 
transductive learning problem WJ\ . 

• Generating some artificial training samples according to the distribution of available training data. 
In the following sections, we will introduce each method in detail. 

3.4.1 Method Using Transductive Learning 

Different from inductive learning algorithms, the transductive learning algorithm is not required to produce 
a general hypothesis that can predict the label of any unobserved data | |67| . As illustrated in Fig. |6l it is only 
required to predict the labels of a given test set of samples. In other words, we can use both training data and 
testing data in the learning procedure. 

Transductive learning has been used to increase sample size in some recent papers Il68ll69l . The main idea 
is to take advantage of the information embedded in the test data so that the role of test samples is changed from 
passive to active. That is, the unlabeled test samples are incorporated into the feature selection and classification 
process. 

3.4.2 Method Using Artificial Training Samples 

The idea is to generate a number of artificial training samples according to the distribution of given samples. 
Then, feature subsets can be assessed using both the generated data and the original data. In FigjTJ we provide 
an example to illustrate the effect of injected artificial samples on model selection. 

There are many methods for generating artificial training samples. For instance, we can first pick one 
training sample Xi randomly and then generate a point z from standard normal distribution. Finally, we generate 
the new artificial point as: y = Xi + hz, where /i is a constant. 

There are two ways in which the artificial points participate in feature selection. One method is to treat the 
injected points as the original samples in the training process f/Ol . Another method is to use the injected points 
only in the evaluation stage f/TI . 
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Figure 6: An illustration of inductive learning and transductive learning. Intuitively, we can consider inductive 
learning as "an education system for all-round development" while transductive learning is an "examination- 
oriented education system". 




Figure 7: The effect of artificial training samples on model selection. The separating hyperplane obtained from 
the data set with injected samples is different from that of the original samples. 
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4 Stability Measure 



In stable feature selection, one important issue is how to measure the "stability" of feature selection algo- 
rithms, i.e., how to qualify the selection sensitivity to variations in the training set. The stability measure can 
be used in different contexts. On the one hand, it is indispensable for evaluating different algorithms in perfor- 
mance comparison. On the other hand, it can be used for internal validation in feature selection algorithms that 
take into account stability. 

Noticing that there is akeady a nice review paper |[T4l on the stability of ranked gene lists, here we would 
like to provide a more comprehensive list that includes evaluation methods from different domains. 

Measuring stability requires a similarity measure for feature selection results. This depends on the repre- 
sentations used by different algorithms. Formally, let training examples be described by a vector of features 
F = (/i, /2, /m), then there are three types of representation methods l72l : 

• A subset of features: S = {si, S2, .., Sk}, Si £ {/i, /2, /m}- 

• A ranking vector: R = (ri, r2, .., r^), 1 < < m. 

• A weighting-score vector: W = {wi,W2, ..,Wm), Wi £ ■ 

In general, we are interested in stability measures that take more than two subsets (or rankings) into account. 
In this review, we use measures defined on two subsets (or rankings) for the sake of notation simplification. 
As pointed out in [14], there are essentially two approaches for generalizing the definition. One approach is 
to summarize pairwise stability measures through averaging. Another approach is to consider all subsets (or 
rankings) simultaneously in the specification of stability measure. 

In the following, we will summarize available stability measures according to the representation of feature 
selection results. 

4.1 Feature Subset 

There is a wide variety of similarity measures available for the comparison of sets. Table |4] summarizes 
available stability measures. One may find that most of these measures are defined using the physical properties 
of two sets, e.g., the ratio of the intersection to the union. One exception is the "percentage of overlapping 
features related" |[T3l . which incorporates additional feature correlation information into the measure definition. 
This is definitely plausible for biomarker discovery applications since there are always some highly correlated 
features in the "omics" data. 

MSI: The relative Hamming distance between the masks corresponding to two subsets is used to measure 
the stability fn\ . 

MS2: The Tanimoto distance metric measures the amount of overlap between two sets of arbitrary cardi- 
nality. It takes values in [0,1], with meaning no overlap between the two sets and 1 meaning two sets are 
identical. In fact, this measure is equivalent to the Jaccard's index: I^J-tItt. 

I oUo I 
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Table 4: A list of stability measures when the feature selection algorithm produces a subset of features as 
output. Here S and S' are two subsets of features. 



Index 



Description 



Formula 



isi+is'l-l-sns'l 

2|5n5' 
\S\ + \S'\ 
2\sns'\ 
V\s\\s'\ 
\sns'\ \sns'\ 

\sns'\+ci2 \sns'\+c2i 
\s\ ; \s'\ 

\Sns'\m-c'^ 
c{m—c) 
1 freq{f)~l 
\SUS'\ ^ , rn~l 

fesus 



MSI 11731 
MS2 1721171173 

MS3 (iiiiiiiia 
MS4 nil 

MS5 eg 

MS6 US 
MS7 JTTl 
MS8 1171 



MS9 lEll 
MSlOim 



Relative Hamming distance 
Tanimoto distance/Jaccard's index 
Dice-Sorensen's index 
Ochiai's index 
Percentage of overlapping features 
Percentage of overlapping features related 
Kuncheva's stability measure 
Consistency 

Weighted consistency 

Length adjusted stability 



f&SuS' 

max{0, Yl 

fesus 



( freqjf) freq{f)~l ' 



m— 1 



freq(f) 
/2|S'US'| 



a 



\s\+\s I 

2m 



)} 



MS3: The Dice-Sorensen's index is the harmonic mean of and ^^i^ ' . 

\sns' \ \sns' \ 

MS4: The Ochiai's index is the geometric mean of ' ' and ' ' . It has been shown that the perfor- 
mance of the Ochiai's index is similar with that of Jaccard's index and Dice-Sorensen's index ifTTI . 

MS5: This measure is originally named as: "Percentage of Overlapping Genes (POG)" in the context of 
gene expression data analysis. 

MS6: It is an extension of POG, which incorporates highly correlated features between two sets into the 
stability evaluation. In the formula, ci2 (or C21) denotes the number of features in S (or S') that are not shared 
but ai^e significantly positively coiTclated with at least one feature in S' (or 5). The normalized form of this 
measure is also presented in |[T3l . 

MS7: This stability measure assumes that S and 5' have the same size (cardinality), i.e., |5| = |5'| = c. 

MS8 and MS9: In both definitions, freq{f) denotes the number of occurrences (frequency) of feature / 
in 5 U . It has been proved that both measures take values in [0,1]. The (weighted) consistency value is 1 if 
two sets are identical and if they are disjoint. 

MSIO: In the formula, a is one user-specified parameter and is set to 10 |[24l . Note that + |5''|)/2 
corresponds to the median required in |[24l since there are only two sets in our formulation. 

4.2 Ranking List 

The problem of comparing ranking lists is widely studied in different contexts such as voting theory and 
web document retrieval. Table [5] shows some distance measures for two ranking lists. One typical example is 
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MR2, in which the Spearman's correlation is adapted to place more weights on those top ranked features since 
these features are more important than irrelevant features in the stability evaluation. 



Table 5: A list of stability measures when the feature selection algorithm generates a ranking list as output. 
Here R and are two different ranking lists. 

Index Description Formula 



MRl |[72l Spearman's rank correlation coefficient 1 — 6^ 



m{rri^ — 1) 



MRim CanbeiTa distance ^ |n^m{r■.,fc+l}-m^n{r^,fc+l}| 

inin{ri,k+l}+min{r^,k+l] 

m 

MR3 ITU Overlap score e""* | {rj \j < i] r\ {r- \j < i} \ 



MRl: The Spearman's rank correlation coefficient takes values in [-1,1], with 1 meaning that the two 
ranking lists are identical and a value of -1 meaning that they have exactly inverse orders. 

MR2: The Canben^a distance is a weighted version of Spearmans footrule distance |[T2l . i.e., V ' . 
Since the most important features are usually located at the top of the ranking list |[T2l . the distance calculation 
in the table only considers top k ranked features. 

MR3: The overlap score is originally proposed in |[79l and here we follow fT4\ to reformulate it with the 
assumption that only top ranked features are important. In the formula, a is a user-specified parameter to 
control the decreasing rate. 

4.3 Weighting-Score Vector 

The computational issue of combinatorial search for feature subset can to some extent be alleviated by 
using a feature weighting strategy |[80l . Allowing feature weights to take real- valued numbers instead of binary 
ones enables us to use well-established optimization techniques in algorithmic development. For instance, the 
RELIEF algorithm lISTIl is one representative of such kind of methods, which generates a weighting score vector 
as output. 

The weighting score vector is seldom used in defining stability measure. Table [6]lists one stability measure 
MWl |[72l . The Pearson's correlation coefficient ranges from -1 to 1. A value of 1 indicates a perfect positive 
con^elation, a value of means that there is no coiTclation, while a value of -1 means that they are anti-correlated. 
In the formula, uyy and u^^i are the means of weight scores of W and W , respectively. 

5 Discussions 

We summarize three sources of instability for feature selection in section 2. Among these sources, probably 
the small number of samples in high dimensional feature space is the most difficult one in biomarker discovery. 
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Table 6: The Pearson's correlation coefficient measure. Here W and W' are two different weighting score 
vectors. 



Index 


Description 


Formula 


MWl 117211 


Pearson's correlation coefficient 





Besides feature selection, other data analysis tasks also face the same challenges. Research progresses in related 
fields will facilitate the development of effective stable feature selection methods as well. 

Group feature selection is the most extensively studied method among existing stable feature selection ap- 
proaches. This is because there are many correlated features in high dimensional space. However, such feature 
grouping strategy can only partially alleviate selection instability since we still need to face the reproducibility 
issue in the transformed space. In this regard, ensemble feature selection is probably more promising to provide 
a general-purpose solution. One immediate hybrid strategy is to combine group feature selection with ensemble 
feature selection, i.e., first perform feature grouping and then use ensemble feature selection in the new feature 
space. 

The group feature selection strategy is only helpful when multiple sets of true markers are generated due 
to the existence of redundant features. However, it is also possible that multiple sets of true markers share no 
correlated features. The feature selection problem in this case is much harder than finding a minimal optimal 
feature set for classification |[82ll . To our knowledge, there is still no available method and measure that aim at 
handling stability issues in this context. The general problem is open and needs more research efforts. 

With respect to stability index, most available measures are defined over feature subsets since the feature 
subset can be obtained from rankings or scores (but not vice-versa). The major problem is that there is still 
no consensus on the best stability measure. Therefore, a comprehensive comparison study on existing stability 
measures should be conducted in future research. 

In fact, the biomarker discovery process involves many procedures. Here we only discuss feature selection 
techniques for stable biomarker identification. The development of biomarker classifier is also very important. 
The readers are refeiTcd to a recent review |[83l for research progress towards this direction. 

Finally, we would like to raise the following questions in the pursuit of stable biomarker discovery methods 
for future research: 

• How to directly measure the stability of feature(s) without samphng training data? 

• Can we propose new methods that are capable of explicitly controlling the stability of reported feature 
subset? 

• Are there other special requirements for biomarker discovery rather than stability? 
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6 Conclusions 



To discover reproducible markers from "omics" data, the stability issue of feature selection has received 
much attention recently. This review summarizes existing stable feature selection methods and stability mea- 
sures. Stable feature selection is a very important research problem, from both theoretical perspective and 
practical aspect. More research efforts should be devoted to this challenging topic. 
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